Mining Acknowledgements Sections

Author

Eva Maxfield Brown and Chris Fu

Published

October 28, 2022

Introduction

While the current scholarly effort of literature review focuses on understanding published works’ vision, content, method, results, limitation, etc., we aim to find meaningful information from acknowledgment section of research papers. Although the acknowledgment section appears in most research papers, as far as we know, they do not gather much interest. We want to understand the different elements of the acknowledgment section, how they are organized, and within a specific field, are there frequently mentioned names and entities? In addition, we will discuss how to incorporate these findings to present helpful information to users when they search looking for related research interests.

Original Dataset

The original dataset of 64 papers was provided to us as a large JSON file that had a lot data within it. For our analysis of acknowledgements sections we only needed a few data points to get started. The original dataset is available below for exploration (minor change just to make it render nicely).

Show Code for Loading the Original Dataset
from IPython.display import JSON
import json

with open("data/599_lit_review.json", "r") as open_f:
    original_dataset = json.load(open_f)
    
JSON({"data": original_dataset})
<IPython.core.display.JSON object>

Compiled Dataset

For our analysis, we really only needed some metadata and a view or download link for each paper which we could then manually go and copy-paste any acknowledgements section into our dataset (we have some thoughts as to how to automate this in a later section).

To extract the data we needed we ran the following code:

Show Code for Compile Dataset for Manual Addition
import pandas as pd

compiled_rows = []
for index, paper in enumerate(original_dataset):
    # Some papers have data from CSL and some from S2
    # Get both so we don't really have to care later on
    
    # Check if the paper has CSL data at all
    if paper.get("csl", None) is not None:
        # Find or get title and url returned by CSL data
        csl_title = paper["csl"].get("title", None)
        csl_url = paper["csl"].get("URL", None)
    else:
        csl_title = None
        csl_url = None

    # Check if the paper has Semantic Scholar data at all
    if paper.get("s2data", None) is not None:
        # Find or get title and url returned by S2 data
        s2_title = paper["s2data"].get("title", None)
        s2_url = paper["s2data"].get("url", None)
    else:
        s2_title = None
        s2_url = None
    
    # Compile all results
    compiled_rows.append({
        "paper_index": index,
        "doi": paper["doi"],
        "s2id": paper.get("s2id", None),
        "s2_url": s2_url,
        "csl_url": csl_url,
        "s2_title": s2_title,
        "csl_title": csl_title,
        "acknowledgements_text": None,
    })
    
compiled_dataset = pd.DataFrame(compiled_rows)

Our dataset after adding all the acknowledgements sections is available below:

Read and Show Data with Acknowledgements Sections Added
from itables import show
import itables.options as table_opts
table_opts.lengthMenu = [5, 10, 25, 50]

raw_data = pd.read_csv("data/raw-ack-sections.csv")
show(raw_data)
paper_index doi s2id s2_url csl_url s2_title csl_title acknowledgements_text
Loading... (need help?)

NER

We can now take each of these acknowledgements sections and run them through a named entity recognition model.

import spacy
# import spacy_transformers

nlp = spacy.load("en_core_web_trf")

# Filter dataset to only include rows with acknowledgements sections
filtered_data = raw_data.dropna(subset=["acknowledgements_text"])

# For each acknowledgement, run it through spacy,
# extract entities and their labels and store to a dataframe
entities_rows = []
docs = []
for _, paper in filtered_data.iterrows():
    doc = nlp(paper.acknowledgements_text)
    docs.append(doc)
    for ent in doc.ents:
        # Store with the DOI so we can join with other data later
        entities_rows.append({
            "doi": paper.doi,
            "entity": ent.text,
            "entity_label": ent.label_,
        })
        
entities = pd.DataFrame(entities_rows)
# combine names with different representation
# entities[entities["entity_label"] == "ORG"]["entity"].unique()

entities['entity'] = entities['entity'].replace(['the National Science Foundation', 'National Science Foundation'], 'NSF')
entities['entity'] = entities['entity'].replace(['Ofce of Naval Research', 'the Office of Naval Research'], 'ONS')
entities['entity'] = entities['entity'].replace(['the ArmyResearch  Laboratory', 'the  Army  Research  Labora-tory'], 'the  Army  Research  Laboratory')
entities['entity'] = entities['entity'].replace(['AI2'], 'the Allen Institute for AI')
entities['entity'] = entities['entity'].replace(['the University of\nWashington', 'National Science Foundation'], 'The University of Washington')
entities['entity'] = entities['entity'].replace(['the Alfred P. Sloan Foun- dation', 'the Alfred\nP. Sloan Foundation'], 'the Alfred P. Sloan Foundation')
# How did the model tag each of these examples?
from ipywidgets import interact
from IPython.display import display, HTML
from spacy import displacy

@interact
def render_example(doc_index=list(range(len(docs)))):
    return display(HTML(displacy.render(docs[doc_index], style="ent")))

Here are the most common entity types:

import altair as alt

alt.Chart(entities).mark_bar().encode(
    alt.X("entity_label", sort="-y"),
    y="count()",
    color="entity_label",
    tooltip=["entity_label", "count()"],
).properties(
    width=400,
    height=300
).interactive()

A bulk of the named entities are people and organizations (which is what we would expect and what we are looking for), we can filter out the rest.

# Filter all rows that aren't people or orgs
people_and_org_refs = entities.loc[entities.entity_label.isin(["PERSON", "ORG"])]

This is still too much data to visualize each person or org’s count so let’s just visualize a the top ten referenced people or entities.

# top ten entities including people and organization
top_ten_entities = people_and_org_refs.value_counts(
    subset=["entity", "entity_label"]
).to_frame().reset_index().rename(columns={0: "count"})[:10]

# top ten entities only including organization
top_ten_entities_organization = people_and_org_refs.value_counts(
    subset=["entity", "entity_label"]
).to_frame().reset_index().rename(columns={0: "count"})[:10]
# visulize top 10 entities including people and organizations
alt.Chart(top_ten_entities).mark_bar().encode(
    alt.X("entity", sort="-y"),
    y="count",
    color="entity",
    tooltip=["entity", "entity_label", "count"],
).properties(
    width=400,
    height=300
).interactive()
# only visulize top 10 organizations
alt.Chart(top_ten_entities_organization).mark_bar().encode(
    alt.X("entity", sort="-y"),
    y="count",
    color="entity",
    tooltip=["entity", "entity_label", "count"],
).properties(
    width=400,
    height=300
).interactive()

Classifying Recognition

blah

Incorporate organization information in Scholar Search Engine

The organization names we extracted from the acknowledgement section can provide funding information for researchers. It also tells users which organizations and institutes are interested in related research topics. Here we discussed several ways we can use this information in different search process.

  1. On the Search Page

Search Page

funding

  1. Related Funding Opportunities
    balah balah